Sequence-Level Knowledge Distillation
نویسندگان
چکیده
Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with only a decrease of 0.2 BLEU. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search.
منابع مشابه
Three-component Distillation Columns Sequencing: Including Configurations with Divided-wall Columns
In the present work, the exergy analysis and economic study of 3 different samples of threecomponent mixtures have been investigated (ESI>1, ESI≈1, and ESI<1). The feed mixture has been tested under three different compositions (low, equal, and high contents of the intermediate component). A quantitative comparison between simple and complex configurations, considering thermally coupled, thermo...
متن کاملThe Design and Optimization of Distillation Column with Heat and Power Integrated Systems
Based on two integration steps, an optimization framework is proposed in this work for the synthesis and design of complex distillation sequence. The first step is to employ heat integration in sequence and reduce the heat consumption and total annual cost of the process. The second one is to increase the exergetic efficiency of sequence by generating power in implemented expanders in sequence....
متن کاملLogic Based Algorithms for the Rigorous Design of Thermally Coupled Distillation Sequences
This paper presents an algorithm for the rigorous design of thermally coupled distillation sequences using process simulators. First we show that the two side streams connections that produce a thermal ‘couple’ can be substituted by a combination of a material stream and a heat flow. In this way, the sequence of thermally coupled distillation columns can be simulated without recycle streams. Th...
متن کاملKnowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay
The process for transferring knowledge of multiple reinforcement learning policies into a single multi-task policy via distillation technique is known as policy distillation. When policy distillation is under a deep reinforcement learning setting, due to the giant parameter size and the huge state space for each task domain, it requires extensive computational efforts to train the multi-task po...
متن کاملSingle Pot Sequential Crystallization-Distillation as a New Purification Procedure
A new purification procedure exploiting the simultaneous presence of a solid, liquid, and gas phase in a low surface area system is proposed and discussed. The assumptions of vanishingly low diffusion coefficients in the solid phase and that of the presence of a single “effective impurity” allow to plan the sequence of operations starting from the knowledge of just the melting and boiling point...
متن کامل